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1 Introduction 


Our goal is to introduce statisticians to the large body of literature on proximal algo¬ 
rithms for solving optimization problems that arise within statistics. By a proximal 
algorithm, we mean an algorithm whose steps involve evaluating the proximal oper¬ 
ator of some term in the objective function. Both of these concepts will be dehned 
precisely in the next section. The canonical optimization problem of minimising a 
measure of £t, together with a regularization penalty, sits at the heart of modern 
statistical practice and it arises, for example, in sparse regression [Tibshirani, 1996], 
spatial smoothing [Tibshirani et ah, 2005], covariance estimation [Witten et al., 2009], 
image processing [Geman and Reynolds, 1992, Geman and Yang, 1995, Rudin et ah, 
1992], nonlinear curve htting [Tibshirani, 2014], Bayesian MAP inference [Poison and 
Scott, 2012], multiple hypothesis testing [Tansey et ah, 2014] and shrinkage/sparsity- 
inducing prior regularisation problems [Green et al., 2015]. For recent surveys on 
proximal algorithms, see [Cevher et al., 2014, Komodakis and Pesquet, 2014, Gom- 
bettes and Pesquet, 2011, Boyd et ah, 2011]. 

The techniques we employ here are often referred to as Proximal Gradient, Proxi¬ 
mal Point, Alternating Direction Method of Multipliers (ADMM) [Boyd et al., 2011], 
Divide and Goncur (DG), Frank-Wolfe (FW), Douglas-Ratchford (DR) splitting or al¬ 
ternating split Bregman (ASB) methods. The held of image processing has developed 
many of these ideas in the form of Total Variation (TV) de-noising and half-quadratic 
(HQ) optimization [Geman and Yang, 1995, Geman and Reynolds, 1992, Nikolova 
and Ng, 2005] . Other methods such as fast iterative shrinkage thresholding algorithm 
(FISTA), expectation maximization (EM), majorisation-minimisation (MM) and it¬ 
eratively reweighed least squares (IRLS) fall into our proximal framework. Although 
such approaches are commonplace in statistics and machine learning [Bien et al., 
2013], there hasn’t been a real focus on the general family of approaches that underly 
these algorithms. Early work on iterative proximal fixed point algorithms in Banach 
spaces is due to [Von Neumann, 1951, Bregman, 1967, Hestenes, 1969, Martinet, 1970, 
Rockafellar, 1976]. 

A useful feature of proximal algorithms are acceleration techniques [Nesterov, 
1983] which lead to non-descent algorithms that can provide an order-of-magnitude 
increase in efficiency. When both functions are convex, and one has a smooth Lipschitz 
continuous gradient, a simple convergence result based on the reverse Pythagoras in¬ 
equality is available. Gonvergence rates of the associated gradient descent algorithms 
can vary and typically each analysis has to be dealt with on a case-by-case basis. We 
illustrate acceleration for a sparse logistic regression with a fused lasso penalty. 

The rest of the paper proceeds as follows. Section 1.1 provides notation and ba¬ 
sic properties of proximal operators and envelopes. Section 2 describes the proximal 
operator and Moreau envelope. Section 3.1 describes the basic proximal algorithms 
and their extensions. Section 4 describes common algorithms and techniques, such as 
ADMM and Divide and Goncur, that rely on proximal algorithms. Section 5 discusses 
envelopes and how proximal algorithms can be viewed as envelope gradients. Sec¬ 
tion 6 considers the general problem of composite operator optimisation and shows 
how to compute the exact proximal operator with a general quadratic envelope and a 


2 


composite regularisation penalty. Section 7 illustrates the methodology with applica¬ 
tions to logistic and Poisson regression with fused lasso penalties. A bridge regression 
penalty illustrates the non-convex case and we apply our algorithm to the prostate 
data of Hastie et ah [2009]. 

Table 1 provides commonly used proximal operators, Table 2 documents examples 
of half-quadratic envelopes and Table 3 lists convergence rates for a variety of algo¬ 
rithms. Appendix A discusses convergence results for both convex and non-convex 
cases together with Nesterov acceleration. Finally, Section 8 concludes with directions 
for future research. 

1.1 Preliminaries 

Many optimization problems in statistics take the following form 

axgmmF^x) : = l{x) + (j){x) ( 1 ) 

where l{x) is a measure of £t depending implicitly on some observed data y, 0 (x) 
is a regularization term that imposes structure or effects a favorable bias-variance 
trade-off. Typically, l{x) is a smooth function and 4>{x) is non-smoothMike a lasso or 
bridge penalty-so as to induce sparsity. We will assume that I and 0 are convex and 
lower semi-continuous except when explicitly stated to be non-convex. 

We use X = {xi,... Xd) to denote a d-dimensional parameter of interest, y an n- 
vector of outcomes, A a fixed n x d matrix whose rows are covariates (or features) 
aj, and B a hxed k x d matrix to encode some structural penalty on the parameter 
(as in the group lasso or fused lasso), h are prior loadings and centerings and 7 > 0 is 
a regularisation parameter that will trace out a solution path. All together, we have 
a composite objective of the form 

n k 

Avh afx) +-f^(j){[Bx - b]j) (2) 

i=l j=l 

For example, lasso can be viewed as a simple statistical model with the negative 
log likelihood from y = Ax + e, where e is a standard normal measurement error, 
corresponding to the norm l{x) = ||Ax — j/||^, and each parameter Xj has independent 
Laplace priors corresponding to the regularisation penalty (f){x) = \^j\- 

Throughout, observations will be indexed by i, parameters by j, and iterations of 
an algorithm by t. Unless stated otherwise, all functions are lower semi-continuous 
and convex (e.g. l{x), 0(x)), and all vectors are column vectors. We will pay particular 
attention to composite penalties of the form (j){Bx), where i? is a matrix corresponding 
to some constraint space, such as the discrete difference operator in fused Lasso. 

The following concepts and dehnitions will be useful: 

Splitting is a key tool that exploits an equivalence between the unconstrained 
optimisation problem and a constrained one that includes a latent-or slack-variable, 
where we write 

min {/(x) -1- 0(Ax)} = min {/(x) -1- (j){z)} subject to z = Ax . 

X x,z 
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Envelopes are another way of introducing latent variables. For example, we will 
assume that the objective l{x) can take one of two forms of an envelope; 

1 . a linear envelope l{x) = sup^ {xz — E{z)} where I* denotes the convex dual. 

2. a quadratic envelope l{x) = inf^ \J^x'^K{z)x — Tf'{z)x + for some A,ri,'ip. 

The convex conjugate of l{x), l*{z), is the point-wise supremum of a family of 
affine (and therefore convex) functions in z] it is convex even when l{x) is not. But 
if l{x) is convex (and closed and proper), then the following dual relationship holds 
between I and its conjugate: 

l{x) = supjA^a; — /*(A)} where /*(A) = sup{A^a; — /(x)} . 

A X 

If l{x) is differentiable, the maximizing value of A is A(a;) = Vl{x). 

A function g{x) is said to majorize another function f\x) at xq if g{xo) = /(xq) 
and g{x) > /(x) for all x 7 ^ xq. If the same relation holds with the inequality sign 
flipped, g{x) is said to be a minorizing function for /(x). A p-strong convex function 
satishes 

f{x) > f{z) + u~^{x — z) + ^\\x — z\\\, where u G df{z) 

and d denotes the subdifferential operator dehned by 

df{x) = {x : f{z) > /(x) -f v'^{z — x),\/z,x G dom(/)} . 

A p-smooth function satishes 

f{x) < f{z) + V/(z)^(x -z) + |i|x - z\\l,^x,z . 

We also use the following conventions: sgn(x) is the algebraic sign of x, and 
x+ = max(x,0); lc{x) is the set indicator function taking the value 0 if x G C, and 
cx) if X ^ C; M"*" = [0, 00 ), IR++ = (0, cx)), and M is the extended real line MUj— 00 , cx)}. 

2 Proximal operators and Moreau envelopes 

The key tools we employ are proximal operators and Moreau envelopes. Let /(x) be 
a lower semi-continuous function, and let 7 > 0 be a scalar. The Moreau envelope 
f'^{x) and proximal operator prox.^j(x) with parameter 7 are dehned as 

P{x) = inf |/(z) + ^\\z - x||^| < /(x) (3) 

prox(x) = argmin 
7/ ^ 

Intuitively, the Moreau envelope is a regularized version of /. It approximates / 
from below and has the same set of minimizing values [Rockafellar and Wets, 1998, 
Chapter IG]. The proximal operator specihes the value that solves the minimization 
problem dehned by the Moreau envelope. It balances the two goals of minimizing / 
and staying near x, with 7 controlling the trade-oh. Table 1 provides an extensive 
list of closed-form solutions. 
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2.1 Properties of Proximal Operators 

Our perspective throughout this paper will be to view proximal hxed point algorithm 
as the gradient of a suitably dehned envelope function. By constructing different en¬ 
velopes one can develop new optimisation algorithms. We build up to this perspective 
by hrst discussing the basic properties of the proximal operator and its relationship 
to the gradient of the standard Moreau envelope. For further information, see Parikh 
and Boyd [2013] who provide interesting interpretations of the proximal operator. 
Each one provides some intuition about why proximal operators might be useful in 
optimization. We highlight three of these interpretations here that relate to the en¬ 
velope perspective. 

First, the proximal operator behaves similarly to a gradient-descent step for the 
function /. There are many ways of motivating this connection, but one simple way 
is to consider the Moreau envelope which approximates / from below. Observe 

that the Moreau derivative is 

dPix) = a inf \f{z) + 7 ^ 11 ^-xlla) = \x - z{x)] 

z y 27 J 7 

where z{x) = pTox^j{x) is the value that achieves the minimum. Hence, 

prox(a:) = x — 'ydp^x ), 

7/ 

Thus, evaluating the proximal operator can be viewed as a gradient-descent step for 
a regularized version of the original function, with 7 as a step-size parameter. 

Second, the proximal operator generalizes the notion of the Euclidean projection. 
To see this, consider the special case where f{x) = ic{,x) is the set indicator function 
of some convex set C. Then proxj(a;) = argmin^g^ ||x — z\\\ is the ordinary Euclidean 
projection of x onto C. This suggests that, for other functions, the proximal operator 
can be thought of as a generalized projection. A constrained optimization problem 
f{x) has an equivalent solution as an unconstrained proximal operator prob¬ 
lem. Proximal approaches are, therefore, directly related to convex relaxation and 
quadratic majorization, through the addition of terms like |||a; — n|p to an objective 
function-where p might be a constant that bounds an operator or the Hessian of a 
function. We can choose where these quadratic terms are introduced, which variables 
the terms can involve, and the order in which optimization steps are taken. The 
envelope framework highlights such choices, leading to many distinct and familiar 
algorithms. 

There is a close connection between proximal operators and hxed-point theory, 
in that prox..yj(x*) = x* if and only if x* is a minimizing value of /(x). To see this 
informally, consider the proximal minimization algorithm, in which we start from 
some point Xq and repeatedly apply the proximal operator: 

x*^^ = prox(x*) = X* — 7 V/'’'(x*). 

7/ 

At convergence, we reach a minimum point x* of the Moreau envelope, and thus a 
minimum of the original function. At this minimizing value, we have Vf'^{x*) = 0 
and thus prox.^j(x*) = x*. 
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Finally, another key property of proximal operators is the Moreau decomposition 
for the proximal operator of /*, the dual of /: 

X = prox(a;) + A prox(Aa;) 

A/ /*A 

/ — prox(a;) = Aprox(Ax) (4) 

A/ /*/A 

The Moreau identity allows one to easily alter steps within a proximal algorithm so 
that some computations are performed in the dual (or primal) space. Applications of 
this identity can also succinctly explain the relationship between a number of different 
optimization algorithms, as described in Section 6 . 

All three of these ideas—projecting points onto constraint regions, taking gradient- 
descent steps, and hnding hxed points of suitably dehned operators—arise routinely 
in many classical optimization algorithms. It is therefore easy to imagine that the 
proximal operator, which relates to all these ideas, could also prove useful. 

2.2 Simple examples of proximal operators 

Many intermediate steps in optimization problems can be written very compactly 
in terms of proximal operators of log likelihoods or penalty functions. Here are two 
examples. 

Figure 1 provides a graphical depiction of these two concepts for the simple case 
f{x) = |x|. In general the proximal operator may be set-valued, but it is scalar-valued 
in the special case where f{x) is a proper convex function. 

Example 1. Figure 1 shows a simple proximal operator and Moreau envelope. The 
solid black line shows the function f{x) = |x|, and the dotted line shows the corre¬ 
sponding Moreau envelope f^{x) with parameter 7 = 1 . The grey line shows the func¬ 
tion |x| -f {l/2){x — xqY for xq = 1.5, whose minimum (shown as a red cross) defines 
the Moreau envelope and proximal operator. This point has ordinate proxj(a;o) = 0.5 
and abscissa f^{xo) = 1, and is closer than xq to the overall minimum at x = 0. 
The blue circle shows the point {xq, f^{xo)), emphasizing the point-wise construction 
of the Moreau envelope in terms of a simple optimization problem. 

Let 0(x) = A||a:||i and consider the proximal operator pio^^^{x). In this case the 
proximal operator is clearly separable in the components of x, and the problem that 
must be solved for each component is 

min |a|z|-I--( z — a:)^| . 


This problem has solution 

z = prox(a;) = sgn(a;)(|x| - A/ 7 )+ = Sx/^{x ), (5) 

Akl/7 

the soft-thresholding operator with parameter A/ 7 . 
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Figure 1: A simple example of the proximal operator and Moreau envelope. 


Example 2. Quadratic terms of the form 

l{x) = -x^Px + q^x + r , 


( 6 ) 


are very common in statistics. They correspond to conditionally Gaussian sampling 
models and arise in weighted least sguares problems, in ridge regression, and in EM 
algorithms based on scale-mixtures of normals. For example, if we assume that {y\x) ~ 
N(Ax, then l{x) = {y — AxYVt{y — Ax)/2, or 


P = A'^ilA , q = —A^Vty , r = y'^Vty/2 


in the general form given above (6). If l{x) takes this form, its proximal operator 
(with parameter I/ 7 ) may be directly computed as 


prox(x) = (P + 7 /) ^{'yAi^x — q ), 


assuming the relevant inverse exists. 

General lesson: the proximal operator provides concise description of many itera¬ 
tive algorithms. Practically useful only if the proximal operator can be evaluated in 
closed form or at modest computational cost. 

3 Proximal Algorithms 

3.1 The Proximal Gradient Method 

One of the simplest proximal algorithms is the proximal-gradient method which pro¬ 
vides an important starting point for the more advanced techniques we describe in 
subsequent sections. 

Suppose as in ( 2 ) that the objective function is F{x) = I(x) (f^x), where l{x) 

is differentiable but 4>{x) is not. An archetypal case is that of a generalized linear 
model with a non-differentiable penalty designed to encourage sparsity. The proximal 
gradient method is well suited for such problems. It has only two basic steps which 
are iterated until convergence. 
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1) Gradient step. Define an intermediate point n* by taking a gradient step with 

respect to the differentiable term l{x): 

— 7 V/(a;*). 

2) Proximal operator step. Evaluate the proximal operator of the non-differentiable 

term 0 (x) at the intermediate point n*: 

x*+^ = prox(n*) = prox{x* - ■yVl^x^)} . (7) 

7 <^ 1<I> 

This can be motivated in at least two ways. 

As an MM algorithm. Suppose that l{x) has a Lipschitz-continuous gradient with 
modulus A;. This allows us to construct a majorizing function: whenever 7 G (0, l/A^], 
we have the majorization 

l{x) + (f){x) < l{xo) + {x - xo)'^'Vl{xo) + 7 ^||x - a;o ||2 + 4>{x), 

27 

with equality at x = Xq. Simple algebra shows that the optimum value of the right- 
hand side is 

X = argmin 1 0(x) H-— uWl r > where u = Xq — 7V/(xo) • 

X I 27 J 

This is nothing but the proximal operator of 0, evaluated at an intermediate gradient- 
descent step for l{x). 

The fact that we may write this method as an MM algorithm leads to the following 
basic convergence result. Suppose that 

1 . l{x) is convex with domain 

2. Vl{x) is Lipschitz continuous with modulus Xi, i.e. 

II V/(a:) - V/( 2;)||2 < Ai||a: - z \\2 Vx, z . 

3. 0 is closed and convex, ensuring that prox^^ makes sense. 

4. the optimal value is hnite and obtained at x*. 

If these conditions are met, than the proximal gradient method converges at rate 1/t 
with hxed step size 7 = l/A^ [Beck and Teboulle, 2009]. 


As the fixed point of a “forward-backward” operator. The proximal gradient 
method can also be interpreted as a means for finding the fixed point of a “forward- 
backward” operator derived from the standard optimality conditions from subdiffer¬ 
ential calculus. This has connections (not pursued here) with the forward-backward 
method for solving partial differentiable equations. A necessary and sufficient condi¬ 
tion that X* minimizes l{x) is that 

0 e d {l{x*) + (j){x*)} = Vl{x*) + d(j){x *), (8) 

the sum of a point and a set. We will use this fact to characterize x* as the fixed 
point of the following operator: 

X* = proxjx* — 7 V/(x*)} . (9) 

7 </> 

To see this, let I be the identity operator. Observe that the optimality condition (8) 
is equivalent to 


0 G 7 V/(x*) — X* + X* + 'jd(j){x*) 

X* — 7 V/(x*) e X* -f- 7 d 0 (x*) 

(/ - 7V/)x* e (/ + 7d0)x* 

X* = (/ + 7d0)-^(/-7V/)x* 

= prox(x* — 7 V/(x*)), 

70 

the composition of two operators. The final line appeals to the fact (see below) that 
the proximal operator is the resolvent of the subdifferential operator: prox.^^(x) = 
(/ -|- 7 ^ 0 )“^(x). Thus to find the solution, we repeatedly apply the operator having 
X* as a fixed point: 

x^~^^ = proxjx* — 7 *V/(x*)} . 

7‘0 

This is precisely the proximal gradient method. 

We now show that the proximal operator is the resolvent of the subdifferential 
operator. By definition, if z ^ {I + ■jdl)~^x, then 


X e (/ -I- 'ydl)z 
X & z + 'ydl{z) 

0 e -(z — x) + dlix) 

7 

Z — x\\\ + l{x) I . 

But for 0 to be in the subdifferential (with respect to z) of the function on the right- 
hand side it is necessary and sufficient for z to satisfy 




z = argmin 

u 


1 


U — X 


2 

2 



prox(x). 

'yl 
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Therefore z = prox^;(a;) if and only if z G It is interesting that (J+ycI/)”^ 

is single-valned and therefore a fnnction, even thongh dl is set-valned. 

The proximal framework also applies to some non-convex regnlarisation penalties, 
e.g. L'^-norm for 0 < g < 1, for which we provide an example in Section 7.4. 

3.2 Iterative Shrinkage Thresholding 

Consider the proximal gradient method applied to a qnadratic-form log-likelihood 
( 6 ), as in a weighted least-sqnares problem, with a penalty fnnction 0(a:). Then 
Vl{x) = A^QAx — A'^Qy, and the proximal gradient method becomes 

x^~^^ = prox{x* — 'y^A'^Q{Ax^ ~ v)} ■ 

A<t> 

This algorithm has been widely stndied nnder the name of 1ST, or iterative shrinkage 
thresholding [Figneiredo and Nowak, 2003]. Its primary compntational costs at each 
iteration are: (1) mnltiplying the cnrrent iterate x* by A, and ( 2 ) mnltiplying the 
residnal Ax^ — y hy A^Q. Typically the proximal operator for (j) will be simple to 
compnte, as in the case of a qnadratic or LCnorm/Lasso penalty, and will contribnte 
a negligible amonnt to the overall complexity of the algorithm. 

3.3 Proximal Newton 

Proximal gradient, or forward-backward splitting, is a generalisation of the classical 
gradient approaches. They only reqnire hrst-order information and their speed can 
be improved by nsing second order information, where the resnlting algorithms mimic 
qnasi-Newton procednres. To do this, notice that the qnadratic bonnd in (7), implied 
by the dehnition of the proximal operator, implements a linear approximation of /(x); 
however, one can, natnrally, nse higher order expansions to constrnct envelopes. If 
we let 

Fh{x, z) = l{z) + Vl{z)'^{x — z) + -{x — zYHz{x — z) 

Then we can calcnlate the proximal operators, 

prox( 2 :) = z — ( 7 “^/-I- ^Vl{z) (10) 

Fh 

Instead of directly nsing the Hessian, = V^l{z), approximations can be employed 
leading to qnasi-Newton approaches. The second-order bonnd, and approximations 
to the Hessian, are one way to interpret the half-qnadratic (HQ) approach, as well as 
introdnce qnasi-Newton methods into the proximal framework. 

Proximal Newton methods are even possible for some non-convex problems; as in 
[Chonzenonx et ah, 2014] and Appendix D. One advantage is that adding second-order 
derivative information can convexity some problems. 
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3.4 Nesterov Acceleration 

One advantage of proximal algorithms is that we can accelerate the seqnences within 
algorithms like (7) by introducing an intermediate step that adds a momentum term 
to the slack variable, z, before evaluating the forward and backwards steps, 

= x' + - i)(x* - 

x‘+^ = prox - 7 -V/(/+^)) 

with 9t = 2/{t + 1 ) and — 1) = {t — l)/{t + 2 ). 

When 0 is convex the proximal problem is strongly convex, and advanced accel¬ 
eration techniques can be used [Zhang et ah, 2010, Meng and Chen, 2011]. 


4 Related Algorithms: ADMM, Divide and Con¬ 
cur, Bregman Divergences 


Many common estimation approaches can be interpreted as proximal point or proxi¬ 
mal gradient methods. Much of the variation within these approaches is simply due 
to the exact objective problem upon which a proximal algorithm is being used. In this 
section, we describe how splitting and functional conjugacy results in new objective 
functions (or Lagrangians), which relate to some well-known algorithms. In Section 6 
we describe an overarching framework for the objective functions of these algorithms 
and describe how proximal operators, their properties and resulting algorithms are 
applied. 

Our original problem mina,/(x) -|- 0(x) is clearly equivalent to 


min l{z) + 0 (x) 

X 

subject to X — z = 0 , 


( 11 ) 


which we refer to as the “primal” problem. We have introduced z as a redundant 
parameter (or “slack variable”), and encoded a consensus requirement in the form of 
the affine constraint x — ^ = 0 . 

Other redundant parameterizations are certainly possible. For example, consider 
the case of an exponential-family model for outcome y with cumulant-generating 
function 'ip{z) and with natural parameter z: 

P{y) = Po{y) exp{|/z -^(z)}. 


In a generalized linear model, the natural parameter for outcome yt is a linear regres¬ 
sion on covariates, Zt = afx. In this case /(x) may be written as 

N 

Z(x) = /j(x) where /j(x) = 'ip(ajx) — yi{ajx ), 

i=l 
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up to an additive constant not depending on x. Now introduce slack variables Zi = 
ajx. This leads to the equivalent primal problem 

N 

min y^{^jJ{zi) - HiZi} + 0(x) 

i=l 

subject to Ax — z = 0 . 

These same optimization problems can arise when one considers scale-mixtures, or 
convex variational forms Palmer et ah [2005]. The connection is made explicit by the 
dual function for a density and its relationship with scale-mixture decompositions. 
For instance, one can obtain the following equality for appropriate densities p{x), q{z) 
and constants /x, k: 

- \ogp{x) = - sup log {pn{x; p -f k/z, z~^)q{z)) 

z>0 

= inf { ^ (x - /X - n/zf - log {Vzq{z)) | . 

where Pn{x', /x, ct^) is the density function for a normal distribution with mean /x and 
variance a^. The form resulting from this normal scale-mixture envelope is similar to 
the half-quadratic envelopes described in Section 5, and-more generally-the objective 
in (15). Poison and Scott [2014] describe these relationships in further detail. 

The advantage of such a variable-splitting approach is that now the £t and penalty 
terms are de-coupled in the objective function of the primal problem. A standard tac¬ 
tic for exploiting this fact is to write down and solve the dual problem corresponding 
to the original (primal) constrained problem. This is sometimes referred to as du- 
alization. Many well-known references exist on this topic Bertsekas [2011]. For this 
reason we focus on problem formulation and algorithms for solving (11), avoiding 
standard material on duality or optimality conditions. 

The latent /slack variables allow us to view the problem of mina, F (x) as one of a 
joint minimisation of min^i^^ F"{x, z) where the augmented can be easily minimi¬ 

sation in a conditional fashion. Such alternating minimisation or iterated conditional 
mode (ICM) [Besag, 1986, Csiszar and Tusnady, 1984] algorithms have a long history 
in statistics. The additional insight is that proximal operators allow the researcher 
to perform the alternating minimisation step for the non-smooth penalty, 0, in an 
elegant closed-form fashion. Moreover, Divide and Concur methods allow difficult 
high dimensional problems to be broken down into a collection of smaller tractable 
subproblems with the global solution being retrieved from the solutions to the sub¬ 
problems. 

The following is a quick survey of some approaches that utilize variable splitting 
and conjugacy. 

Dual Ascent We hrst start with the simple problem 

min /(x) subject to Ax = y 
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. We can solve this with a Lagrangian of the form 


L{x, z) = l{x) + z'^{Ax — y) = l{x) + {A^z)'^x — z'^y . 

The dual function is g{z) = mix L{x, z) = —l*{—AAz) — y'^z and the dual 
problem is max^ g{z). 

Let p* and d* be the optimal values of the primal and dual problems, respec¬ 
tively. Assuming that strong duality holds, the optimal values of the primal 
and dual problems are the same. Moreover, we may recover a primal-optimal 
point X* from a dual-optimal point 2 :* via 

X* = argminL(a;, z*) 0 G dxL{x*, z*). 

X 

The idea of dual ascent is to solve the dual problem using gradient ascent via 
Vg{z) = VzL{xz,z) , where Xz = argminL(a;, ; 2 ). 

X 

The second term is simply the residual for the constraint: VzL{x, z) = Ax — y. 
Therefore, dual ascent involves iterating two steps: 

x^^^ = argmin L(a;, 2 ;^) 

X 

^k+i ^ akiAx^^^ - y) 
for appropriate step size a^- 

Augmented Lagrangian Take the same problem as before, 

min l{x) subject to Ax = y 

X 

with Lagrangian L(a;, z) = l{x) + z'^{Ax — y). 

The augmented-Lagrangian approach (or method of multipliers) seeks to stabi¬ 
lize the intermediate steps by adding a ridge-like term to the Lagrangian: 

L^{x,z) = l{x) + z'^{Ax - y) + '^\\Ax-y\\l. 

One way of viewing this is as the standard Lagrangian for the equivalent problem 

min l{x) + ^\\Ax-y\\l 
subject to Ax = y , 

For any primal-feasible x, the new objective remains unchanged, and thus has 
the same minimum as the original problem. The dual function is g^{z) = 
\nix L^{x, z) which is differentiable and strongly convex under mild conditions. 
We can now use dual ascent for the modihed problem, iterating 

x^^^ = argmin {l{x) + z'^{Ax^ — u) + — y\\l\ 

X ^ Z J 
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^k+i = ak{Ax^"^^ - y). 

Thus the dual-variable update doesn’t change compared to standard dual as¬ 
cent. But the X update has a regularization term added to it, whose magnitude 
depends upon the tuning parameter 7. Notice that the step size 7 is used in 
the dual-update step. 

Scaled form. Now re-scale the dual variable with u = 7 “^;^. We can rewrite 
the augmented Lagrangian, with r = Ax — y, as 

L^{x,u) = l{x) +'yu'^{Ax-y) + '^\\Ax - y\\l 

7 / \ I 7 II I ||2 7 II ||2 

= AX) + ^\\r + u \\2 --\\u \\2 
This leads to the following dual-update formulas: 

= argmin | l{x) + -||Aa; — y + u^\\l\ 

3; I 2 ) 

^k+i = ^k (Ax^^'^ - y). 

Notice that the re-scaled dual variable is the running sum of the residuals = 
Ax^ — y from the primal constraint. This is handy because the formulas are 
often shorter when working with re-scaled dual variables. 


Bregman iteration. The augmented Lagrangian method for solving L^-norm/Lasso 
problems is called “Bregman iteration” in the compressed-sensing literature. 

Here the goal is to solve the “exact recovery” problem via basis pursuit: 

min ||a;||i 

X 

subject to Ax = y , 

where y is measured, x is the unknown signal, and H is a known “short and 
fat” matrix (meaning more coordinates of x than there are observations). 

The scaled-form augmented Lagrangian corresponding to this problem is 
L^{x,u) = ||a;||i -h '^\\Ax - y + u\\l - '^\\u\\l, 

with steps 

x^^^ = argmin |||a;||i -|- — Zk\\l\ 

= y + z’^- Ax^"^^ , 

where we have redehned = y — compared to the usual form of the dual 
update. Thus each intermediate step of Bregman iteration is like a lasso regres¬ 
sion problem. (In the compressed sensing literature, this algorithm is motivated 
a different way, by appealing to Bregman divergences. But it’s the same algo¬ 
rithm.) 
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ADMM Combining the ideas of variable splitting with the augmented Lagrangian 
one arrives at a method called ADMM (alternating-direction method of mul¬ 
tipliers) for solving the problem (11). The scaled-form augmented Lagrangian 
for this problem is 

L^{x, z, u) = l{z) + 0(x) + ^\\x- z + u\\l + |||m ||2 . 

ADMM is similar to Dual Ascent for this problem, except that we optimize 
the Lagrangian in x and z individually, rather than jointly, in each pass (hence 
“alternating direction”): 

z^^^ = argmin |/(2;^) -t- '^\\x^ 

x^^^ = argmin |0(x^) -f- 

a; f 2 

^k+l =yk ^k+1 _ ^k+l _ 

The hrst two steps are the proximal operators of l{x) and 0(a;), respectively. 

One way of interpreting ADMM is as a variant on the proximal gradient method 
for the dual problem corresponding to (11). As a result, Nesterov-type acceler¬ 
ation methods may also be applied, with extra care to regularity conditions (in 
particular, strong convexity of l{x)). 

4.1 Bregman divergence and exponential families 

Let d{x) be a strictly convex differentiable function with convex/Legendre dual b. 
The Bregman divergence from x to y induced by d is 

Dd{x, y) = d{x) - d{y) - d'{y){x -y)>t). 

This is the vertical distance between d{y) and the extrapolated “guess” for d{y) based 
on the tangent line at x. In the multivariate case everything carries through with 
gradients/planes replacing derivatives/lines. 

There is a unique Bregman divergence associated with every exponential family. 
It corresponds precisely to the relationship between the natural parameterization and 
the mean-value parameterization. Suppose that 

piy, S) = Po{y) exp{ye - b{6)} . 

The expected value of y, as a function of 6, is given in terms of the cumulant- 
generating function as y,{6) = b'{6). This is sometimes referred to as Tweedie’s 
formula [Robbins, 1964, Efron, 2011], and has come up repeatedly in a variety of 
different contexts. By the envelope formula, the maximizing value of /i in the second 
equation (as a function of 9) satisfies /i(6*) = b'{9). This lets us recognize the dual 
variable /i as the mean-value parameterization; that is, the natural and mean-value 
parameterizations form a Legendre pair. 


- z'^^^ + u’^Wl^ 
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Hence, we can write an exponential-family model as either: (1) in terms of the 
natural parameter 6 and the cumulant-generating function b, p{y, 6, 6); or (2) in terms 
of the mean-value parameter p and the Bregman divergence induced by the Legendre 
dual d = b*, p{y, p, d). 

Splitting on the mean-value parameter. Another use of variable splitting is 
when we write the model in terms of its mean-value parameterization: 

Pivu oc eyip{-Dd[yi,Pi{x)]} , 

where d = b* and pi{x) = Eiypx) is the expected value, given the parameter. 

Assuming we are still using the canonical link, we may now write the model in 
terms of a penalized Bregman divergence with split variables: 

N 

min Ddivi, Zi) + 0(a;) 

X.Z f ^ 

i=l 

subject to p{ajx) — Zi = 0 . 
where 4>{x) is the penalty function. 

Example 3 (Poisson regression). In a Poisson model y Pois(/ii), Pi = exp(6'i) for 
natural parameter 6i = ajx. The eumulant generating function is b{6) = exp(6'), and 
thus d{p) = p log p — p. After simplification, the divergence Dd{y, p) = p — y log p + 
{p — y). The optimization problem can then be split as 

N 

min y^fzi - Pi log^i) -f 0(a;) 

X^Z ' ^ 

i=l 

subject to aJx = log Zi . 

4.2 Divide and Concur 

Divide and Concur provides a general approach to hierarchical statistical models that 
require optimisation of a sum of J composite functions of the form 

J+i 

max yy lj{Ajx) + (f){Bx) 
i=i 

DC adds slack variables, Zj for j G [1,..., J -|- 1], to “divide” the problem together 
with equality constraints so that the solutions “concur”. We have the equivalent 
constrained optimization problem 

./+i 

max Ijjzj) under constraints Zj = AjX, zj+i = Bx. 

j=i 
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where /j+i = 0, Aj^i = B. This can be solved using an iterative proximal splitting 
algorithm (e.g. multiple ADMM, split Bregman). Specihcally, under ADMM [Parikh 
and Boyd, 2013] one hnds, with 

= prox(:r* — u^) 

XljoAj 

= u] + - x*+^ . 

Divide and Concur [Gravel and Elser, 2008] methods are a natural approach to big 
data problems as they break a hard high-dimensional problem into tractable, inde¬ 
pendently computable sub-problems via splitting and then End the global solution 
from the solutions to each sub-problem. 

5 Envelope Methods 

In this section we introduce different types of envelopes: the forward-backward en¬ 
velope (FBE), Douglas-Rachford envelope (DRE), and the half-quadratic (HQ) enve¬ 
lope, and Bregman divergence envelopes. Within this framework, new algorithms are 
generated as a gradient step of an envelope Section 6 dissects these envelopes, shows 
their relationship to Lagrangian approaches, and provides a framework within which 
they can be derived and extended. 

5.1 Forward-Backward Envelope 

Suppose that we have to minimise F = l-\-(j) where I is strongly convex and possesses 
a continuous gradient with Lipschitz constant A/ so that |V^/(a;)| < A;. The penalty 0 
is only assumed to be proper lower semi-continuous and convex. If we don’t have an 
“exact” quadratic envelope (see the discussion in 6 . 2 ), then we can argue as follows. 

First, we define the FBE, F™(x), which will possess some desirable properties 
(see Patrinos and Bemporad [2013]). 

F™(a;): =min < l{x) -\- Vl{x)^{v — x) -\- 0(n) + 7 ^ ||^^ — a; 

' w 27 

='W - 0|v/(i)||"+ (i - 7ViW) 

If we pick 7 G (0, A[“^), the matrix / — 7 V^/(a;) is symmetric and positive definite. 
The stationary points of the envelope F™{x) are the solutions x* of the original 
problem which satisfy x = pTox^^[x — 7 V/(a;)). This follows from the derivative 
information 

VF™(a:) = (/ — 'y'V^l{x))Gj{x) where G-f{x) = 'y~^{x — P-fx)) 
where 10-(a;) = prox..|,^(a: — 7 V/(a;)). 

With these definitions, we can establish the descent property for the FBE 

7™(i)<F(i)-a||G,(x)ii^ 
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f’(AW) < F™(^) - |(1 - 7A,) ||G,(a:)||^ , 

Hence for 7 G (0, A;“^) the envelope value always decreases on application of the 
proximal operator of 70 and we can determine the stationary points. See Appendix A 
for further details. 

5.2 Douglas-Rachford Envelope 

Mimicking the forward-backward approach, Patrinos et ah [2014] derive the Douglas- 
Rachford envelope (DRE) 

F^^{x) = P{x) - ^\\VP{x)\\l + <P^x- 2^VP{x)) 

= min < l{x*) + Vl{x*y{z — x*) + cj){z) - 1 - —— x*\\'^ 

z I 27 

where P is, again, the Moreau envelope of the function I and x* = prox..|,;(x). This 
can be interpreted as a backward-backward envelope and is a special case of a FEE 
evaluated at the proximal operator of 7 /, namely 

F^^{x) = F^^ (pTox{x)^ . 

Again the gradient of this envelope produces the following proximal algorithm (see 
Patrinos et ah [2014]) which converges to the solution to miuj. {l{x) + 4>{x)} given by 
the iterations 



= prox(a;*) 

■yl 

2 ;*+! = prox( 2 tc* - 
^t+1 = {z^ - w*) 

There are many ways to re-arrange the DR algorithm. For example, with an inter¬ 
mediate variable, v = w — x, we could equally well iterate 

= prox(a:* — n*) , = prox(ta^ -[- n*) , + (tc* — x*) . 

7/ 70 

5.3 Half-Quadratic Envelopes 

We now provide an illustration of a quasi-Newton algorithm within the class of 
Half-Quadratic (HQ) optimization problems [Geman and Yang, 1995, Geman and 
Reynolds, 1992]. This envelope applies to the commonly used L^-norm where l{x) = 
||Ax — y\\‘^, and can be used in conjunction with some non-convex 0. See Nikolova 
and Ng [2005] for convergence rates and comparisons of the different algorithms. 

The half-quadratic envelope (HQE) is dehned by 

F^‘^(x) = inf {Q(x, v) -f 0 (n)} 

V 
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where Q{x, v) = vx^ or {y — xY 


and the function, Q{x,v)^ is half-quadratic in the variable v. In the HQ framework, 
the term 'ipiy) is usually understood to be the convex conjugate of some function, e.g. 
'ipiy) = (f)*{x). 

Example 4. Suppose that we wish to minimise the functional 

1 -Y 

F{x) = - I \Ax — nY + 7 <h(x) where <h(a;) = ^ (j){{B^x — b)i) 

i=l 

and we’re given 0(a;) = F^^{x). Then we need to solve the joint criterion 

^ d d 

F{x,v) = -\\Ax - y\^ + . 

i=\ i=l 

where 6i = {B^x — b)i. There is an equivalence between gradient linearisation and 
quasi-Newton. These algorithms give the iterative mappings: 

= L{v{x^))~^A^y and = x^ — L{x^)~^VxF{x^), 

where L(x*) is a step size function. They are identical, with derivative information 


VxF{x) = A^Ax - A^y 7 ^ 




2 = 1 


11^*1 


= {A^A + -fB V{x)B^)x - A^y 
= L{v{x))x — A^y 


for V(x) = diag{v{\\6\\'^^Y) F{v{x)) = A^A -|- vB V{x)B'^. 

Here v{x) = 0'(x)/2x for Geman-Yang (GY) and v{x) = x — 0'(x) for Geman- 
Reynolds (GR). 


5.4 Bregman Divergence Envelopes 

Many statistical models, such as those generated by an exponential family distribu¬ 
tion, can be written in terms of a Bregman divergence. One is then faced with the 
joint minimisation of an objective function of the form D{x,v) + 4>{x) + ipiv). To 
minimise over (x, v) we can use an alternating Bregman projection method. To per¬ 
form the minimisation of v given x we can make use of the D-Moreau envelope which 
is dehned by 

0^(x) = inf {D{x, v) + (fiv)} 

V 

where D{x,v) is a Bregman divergence, D{x,v) > 0 and attains equality at x = x. 
The Bregman divergence has a three-point law of cosines triangle inequality, which 
helps to establish descent in proximal algorithms (see Appendix A). Many commonly 
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used EM and MM algorithms in statistics and variational Bayes models use envelopes 
of this type. 

The key insight is that the proximal operator generated by the D-Moreau enve¬ 
lope allows one to add non-smooth regularisation penalties to traditional exponential 
family models. In our applications, we illustrate this with logistic and Poisson regres¬ 
sion both of which can be interpreted as Bregman divergence measures of £t in the 
objective function. 

We now turn to the general case of a quadratic envelope with a composite regu¬ 
larization penalty. 

6 Proximal Algorithms for Composite Functions 

Building off the general objective in (1), we now consider the composite objective 
given by the optimisation 


min F{x) : = l{x) + (j){Bx) . 

X 

Composite mappings of the form, (j){Bx), arises in multi-dimensional statistical mod¬ 
els that account for structural constraints or correlations, making such terms both 
common and important consideration in the construction and estimation of a model. 
Therefore, any practical framework for estimating statistical models must be capable 
of addressing these mappings somewhat broadly. The methodology described here 
uses splitting, proximal operators and Moreau envelopes. We find that this combina¬ 
tion of tools can be used together easily, applies to a broad range of functions and 
underlies many state-of-the-art approaches that scale well in high dimension. 

We start by noting that many optimization approaches, including the ones in 
Section 4, can be summarized by listing the general forms of the objective func- 
tions/Lagrangians that result from splitting and duality: 


primal 
primal-dual 
split primal 
split dual 


F{x) = l{x) + (f){Bx) 

Fpd{x, z) = l(x) + z'^(Bx) - 
Fsp{x, w, z) = l{x) + 4>{w) + z^{Bx — w) 
Fsd{x, w, z) = l*{w) + (j)*{z) + x'^{-B'^z - w) 


The motivation for using the primal-dual and the split forms (see Esser et ah [2010]) 
lies in how they decouple 0 from B without affecting its solution to the primal problem 
min 3 ,F(a:). We refer to these re-formulations of the primal objective function, and 
their implied minimization/maximization requirements, as joint objective problems. 
The exact objective problems given above are by no means exhaustive and need not 
apply to only one function in the primal objective. 

As mentioned in Section 4, the split problems can be viewed as Lagrangian for¬ 
mulations that each arise separately from the dehnition of the convex conjugate or 
Fenchel dual, and relate to each other, in the general case, by the Max-Min inequality 
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[Boyd and Vandenberghe, 2009] 


sup inf F{q, v) < inf sup F{q, v) 

q q 

In the special case of closed proper convex functions, we have the following 

minF(a;) = minsupFp£)(a;, = maxminF 5 p(a;, w, 2 ;) = maxmin^ 5 x 5 ( 0 ;, w, 2 ;) , 

X X ^ Z X,W X z,w 

made possible for the dual problems by noting the equality when (j) is convex 

(j){Bx) = snp [z’^Bx — (j)*{z)] . 

Z 

In this case, Fsp{x,w,z) and Fpd{x,z) are equated by 
minF 5 p(a;, w, z) 

w>0 


The solutions x*,w*, and z* can also be the results of proximal operators. 

6.1 Proximal Solutions within Objective Problems 

Given an objective problem, one must specify the exact steps to solve the sub¬ 
problems within it, i.e. the problems in w and/or In some cases, closed forms 
solutions for the primal or dual functions (i.e. l{x),l*{w),(j){x),(j)*{z)) in some vari¬ 
ables might not be available, or computationally efficient; however, exact solutions to 
related problems that share the same critical points may be easily accessible. These 
related problems, or the entire objective problem, can take the form of the envelopes 
in Section 5 and, as a result, the solutions for terms in the objective can be proximal 
operators. In fact, the envelope representation can be seen as a way to represent- 
altogether-the combination of an objective problem and the solutions to each of its 
latent/slack/splitting terms as proximal operators. 

Especially in cases where multiple majorization steps are taken (to solve for- 
say-tc and 2 ; in a Fsp problem) the use of proximal operators, their properties, and 
the associated hxed-point theory can simplify otherwise lengthy constructions and 
convergence arguments. As well, using the proximal operator’s properties, like the 
Moreau identity, one can move easily between the different objective problems and, 
thus, primal and dual spaces. It is also worth mentioning that the efficacy of certain 
acceleration techniques can depend on the objective problem (see Beck and Teboulle 
[2014]) and, similarly, the proximal steps taken. 

For a further connection to the general optimization literature, the quadratic term 
in the proximal operator can be seen as a quadratic penalty for a linear constraint in 
a Lagrangian. When a split objective is used, the application of a proximal operator 


= mm 

ti;>0 


+ l{x) + z^{Bx — w)} 

= lix) + z^Bx + min | d)(w) — z'^w} 
w>0 ^ J 

= l{x) -f z^Bx — (f)*{z) 

= Fpd{x,z) 


21 


results in an objective function that is very similar-or equivalent-to an augmented 
Lagrangian. Specifically, the addition of a squared term in the Fsp problem leads to 
the ADMM estimation technique in which one iterates through conditional solutions 
to X and 2 ; at each step, with solutions given by proximal points. Both Parikh and 
Boyd [2013] and Chen and Teboulle [1994] observe that, for the splitting/composite 
problem, the augmented Lagrangian for ADMM is 


4>{w) + l{x) + z^{Bx — w) + ^\\Ax — z\\'^ 
= Fsp(x,w,z) + x\\Ax - z||^ 


( 12 ) 


We can consider direct proximal solutions to some variables in this objective; for 
instance, z* = prox^^^^^ The objective is then 

Fsp{x,w,z*) + ^\\Ax-z*\\^ 

If z* as a function of w is linear, then it may be possible to take another proximal 
step, like w* = prox^sp(^,^,^*)/p(. • • )• 

We aren’t restricted to using the proximal operators directly implied by an ob¬ 
jective problem, such as those that appear when 1,1* and/or 0,0* are-or contain- 
quadratic terms in their arguments. Instead, one can apply a surrogate or approx¬ 
imation (e.g. envelopes, majorization/minorization) to terms within an objective 
problem and effectively induce a proximal operator. This could be done for the pur¬ 
poses of imposing or approximating a constraint, as in (12), or even for approximating 
solutions to such a constraint. Notice that the proximal step producing z* for (12) 
involves a composite argument. Ax. When exact solutions to the composite proximal 
operator aren’t available, one can consider “linearizing” —with — 


where A) < A^, yielding 


Fsp{x,w,z) + f^\\Ax - ^11^ < Fsp{x,w,z) 


\x — z\\ 


This approach can be seen as a simple majorization, and, when combined with 
the proximal solution for z, as a forward-backward envelope for the sub-problem. 
Implementations of this approach include the linearized ADMM technique, or the 
split inexact Uzawa method, and are described in the context of Lagrangians by 
Chen and Teboulle [1994] and primal-dual algorithms in Chambolle and Pock [2011]. 
Magmisson et ah [2014] details splitting methods in terms of augmented-Lagrangians 
for non-convex objectives. 

To demonstrate the framework described here, we give an example of how proximal 
operators, their properties, and these concepts can be used to derive an algorithm for 
a specihc objective problem. 


Example 5. For proper, convex I{x), (j){x) with Lipschitz continuous derivatives, we 
start with the primal-dual problem 


max inf {l{x) -|- z'^{Bx) — (j)*{z)} 

Z X ^ 
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and notice that the argmin for the sub-problem in x, l{x) + z^{Bx), is given by the 
fixed point, for A; > 0, 

X* = prox (x*) . 

\i (l{x)-\-z^Bx) 

By a property of proximal operators, namely 

prox (g) = prox(g — u) , (13) 

g(z)+u'^z 9 

for a generic function g{z) and variables q, z and u (obtained by completing the square 
in the definition of the operator) we have 

X* = prox (x*) = prox(a:* — XiB^z) . 

\l(l{x)+z'^ Bx) 

Now, we’re left with only the sub-problem in z, 

max {/(x*) + z^{Bx*) — (j)*{z)} = — min {(j)’’{z) — z^{Bx*) — /(x*)} . 

We can take yet another proximal step, for the minimization problem, (j)*{z)—z'^{Bx*), 
in z with constant X^. Using (13) and (4), we find that the argmin satisfies 

z* = prox( 2 ;* + X^pBx*) 

Next, let’s say we find that the proximal solution to 0* is problematic in some 
cases, yet the solution x* is still desirable. Using the Moreau decomposition in (4), 
we can easily derive an alternative for those cases: 

prox( 2 ;* + X^Bx*) = ^ i ^ - prox ) o {X^{z* + Bx*)) 

Hence, we have the following implied iterative algorithm: 

X* = prox(a;* - XiB'^z*) 
hi 

1 / \ (14) 

z* = — / - prox o + Bx')) 

\ <i>/>'4, j 

If we further separate the last step in (14) into two steps and simplify by setting 
Xi = X^ = 1, we arrive at 

X* = prox(a:* - B'^u*) 

i 

w* = prox(M* + Bx*) 

<P 

u* = u* - {w* - Bx*) . 

This has the basic form of techniques like alternating split Bregman, ADMM, split 
inexact Uzawa, etc., which demonstrates how versatile the proximal operator and it’s 
properties are when applied to the broad class of objective problems. The differences 
between approaches often involve assumptions on I and 0, such as Lipschitz continuity, 
and the exact order of steps. See Chen et ah [2013] for more details. 
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6.2 General Quadratic Composition 

Consider, now, the most general form of a quadratic objective 


argmin inf z) = K[z)x — rj {z)x + (j){Bx) 


(15) 


where A(z) > 0. Again, such forms can arise when one majorizes with a second-order 
approximation of l{x) around This also makes (15) the Moreau envelope dehned 
in (3). The general quadratic case, in which A(z) is not necessarily diagonal, can be 
addressed with splitting techniques. 

This form, when A(z) is symmetric positive dehnite, encompasses the approaches 
of Geman and Yang [1995], Geman and Reynolds [1992] Assuming B is positive 
dehnite, a proximal point solution can be obtained by setting l{x) = x'^A{z)x — rj'^x 
in (14). The general solution to a quadratic-form proximal operator-like (6)-is, again, 
given by 

prox(g) = {I + XiA{z))~^ {q + XiT]) 


which, together with the split-dual formulation, implies a proximal point algorithm 
of the form 


X* = prox(x* — XiB'^ z*) 

= {I + AA(Y))-' (x* - XiB^z* + XiT]) 

z* = ^ i I - prox j o {X^{z* + Bx*)) 

Xcj, \ 4>/x^ J 

We’ve now introduced the sub-problem of solving the following system of linear 
equations: 

(J -h XiA{z))q* = iq + XiT]) . 

Using the exact solution to the system of equations would rehect methods that in¬ 
volve Levenberg-Marquardt steps, quasi-Newton methods, and Tikhonov regulariza¬ 
tion, and is related to the use of second-order Taylor approximations to an objective 
function. Naturally, the efficiency of computing exact solutions depends very much 
on the properties of / -|- XiA{z), since the system dehned by this term will need to be 
solved on each iteration of a hxed point algorithm. When A( 2 ;) is constant, a decom¬ 
position can be performed at the start and reused, so that solutions are computed 
quickly at each step. For some matrices, this can mean only 0{n) operations per 
iteration. In general, however, the post-startup iteration cost is O(n^). 

Other approaches, like those in Chen et al. [2013], Argyriou et ah [2011] do not 
attempt to directly solve the aforementioned system of equations. Instead they use a 
forward-backward algorithm on the dual objective, Fpd. For simplicity, let A(z) = A 
be symmetric positive dehnite, and A = R its Cholesky decomposition. The 
Cholesky decomposition won’t be a required component in the resulting implied al¬ 
gorithm; it is used here for a simplihed exposition. 
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Starting with the split-dual objective for l{x) = Ax — rj'^x, 

{ X rr\ rji rj-i | 

-X Ax — r] X + z Bx — (f) {z)\ 

2 J 

= minmax | lr\\Rx — R~^{ri — B'^z)\\'^ — ]:\\R~^{r] — B^z)\\‘^ — • 

X z y2 2 j 

A solution for the problem in x is easily obtained from the proximal operator and is 
X* = A~^{ri — B'^z), or from the second line we could still arrive at the same solution 
via a first-order-or linearized- quadratic bound inspired by the 2-norm inequality 
||Mn|| < ||M||||n||. That is 

\\Rx — R~^{ri — B^z)\\‘^ < ||-R||^||a; — A~^{ri — B^z)\\‘^ 

< cr^eo,{.A)\\x - A~^{ri - B'^z)\\’^ 


Now, at X 


max 

Z 


= X* we have the following problem in z: 

- B^z)f - r(j)} = mm - R-^r,f + ^(z)} 


Again, we can use a forward-backward proximal solution to the above problem, where 
l{z) = ^\\R~^B^z — R~^t]\\‘^, so that 

Vl{z) = BR-^ {R-^B^z - R-^t]) = As {BA-^B^z - BA-^r]) , 

Then, with As > (Jniax(-B^~^-B^)/2, we can obtain z* as the proximal solution 

2 ;* = prox( 2 ; - AsV/( 2 ;)) 


A2 0* 


2;* = prox(2; - As (-BA ^B'^z + BA ^77)) 

= I / - prox ) o ((/ - X^BA-^B^) z + BA-^r]) 


(16) 


In sum, we have an implied proximal point algorithm similar to (14) that is, instead, 
based on a hrst-order forward-backward method. 

Example 6. A related example of this variety of split forward-backward algorithm is 
used by Argyriou et al. [2011], who apply Picard-Opial iterations given by 

Hk = kI {1 — k)H , 

for n G (0,1), to find a fixed point, v*, of the operator 


H(v) 


I — prox 


{BA-^p + (/ - -fBA-^B'^)v) 


,Vv e 
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where 0 < 7 < ‘Ijumax . The operator H is understood to he non-expansive, 

so, by Opial’s theorem, one is guaranteed convergence, and, when H is a contraction, 
this convergence is linear. After finding v*, one sets x* = A~^ (7 — xB"^v*^. 

Noting the similarities with (16), we see that v here can he interpreted as the 
dual variable z. What distinguishes this approach from others is that there are fewer 
upfront restrictions on the matrix operator B. Chen et al. [2013] discuss the number 
of iterations, k, in the process of finding the fixed point v* and detail a one-step 
algorithm with similar scope. 

7 Applications 

7.1 Logit loss plus Lasso penalty 

To illustrate our approach, we simulate observations from the model 

iVilPi) ~ Binom(J,pi) 

Pi = logit"^(afa;) 


where i = 1,..., 100, aj is a row vector oi A & j^iooxsoo^ ^ ^ J = 2. The A 

matrix is simulated from N(0,1) variates and normalized column-wise. The signal x 
is also simulated from N(0,1) variates, but with only 10% of entries being non-zero. 

Here m, are the number of trials, yi the number of successes and m = 'Y^=i 
total number of trials in the classihcation problem. The composite objective function 
for sparse logistic regression is then given by 


argmm 

X 


n 

^ |mi log(l -F - yiO^x^ 

i=l 


P 

+ A ^ \xj 
i=i 


To specify a proximal gradient algorithm all we need is an envelope such as those 
commonly used in Variational Bayes. In this example, we use the simple quadratic 
majorizer with Lipschitz constant A given by ||H'^H||2/4 = crmax(^)/4, and a penalty 
coefficient A set to 0.1(Tmax(^)- 

Figure 2 shows the (adjusted) objective values per iteration with and without 
Nesterov acceleration. We can see the non-descent nature of the algorithm and the 
clear advantage of adding acceleration. 


7.2 Logit Fused Lasso 

To illustrate a logit fused lasso problem, we compare a Geman-Reynolds inspired 
quadratic envelope for the multinomial logit loss and a fused lasso penalty with the 
standard Lipschitz-bounded gradient step. We dehne the following quantities 

n 

A(n) = 2 miX{aJv)aiaj = 2H'^diag(m ■ A(Hn))H 

i=l 
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Figure 2 : (Adjusted) objective values for iterations of the proximal gradient method, 
with and without acceleration, applied to a logistic regression problem with an L^- 
norm penalty. 


rf = -mi/2)af . 

i=\ 

Now we compute Xt, conditional on ta, for the envelope 

|mj log(l + = min < -x'^A{w)x — rfx + c{w) + '^\\D^^'^x 

i=l ^ ^ 

To do this, we employ the Picard-Opial composite method of Argyriou et ah [2011]. 

Simulations were performed in a similar fashion as Section 7.2 but with N = 
100, M = 400, m = 2 and where D^^^x has a fused lasso construction consisting 
of hrst-order differences of x. Figure 3 show the objective values for iterations of 
each formulation. With the use of second-order information, we have extremely fast 
convergence to the solution. 

For data pre-conditioning, we perform the following decompositions: A = U'EV'^, 
the singular value decomposition (SVD), A“^(n) = D~^A~'^^ where D = diag(m- 

A (An)). This implies that one SVD of A, or generalized inverse, is required to compute 
all future A“^(n) and thus providing computational savings. 

7.3 Poisson Fused Lasso 

To illustrate an objective that is not Lipschitz, but still convex, we use a Poisson 
regression example with a fused lasso penalty. We simulated a signal given from the 
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Figure 3: Objective values for iterations of two proximal composite formulations ap¬ 
plied to a multinomial logistic regression problem with a composite L^-norm penalty. 
Both are run until the same numeric precision is reached. 


model 


{y\x) Pois(exp(y4x)) 

p 

0(x) = \xj — Xj-i\ 

i=i 

In our simulation, the true sparse parameter vector x has 10% non-zero signals from 
N(0,1). The design matrix A G is also generated from N(0,1), then column 

normalized. 

In sum, we have a negative log-likelihood and regularization penalty of the com¬ 
posite form 

n p n 

F{x) = exp(afx) — Uiajx + \xj — Xj_i| = ^^exp(afx) — i/iajx + . 

2=1 j = l 2=1 

where Oj are the column vectors of A and D^^^x is the matrix operator of hrst-order 
differences in x. Since the Poisson loss function is not Lipschitz, but still convex, 
we replace the constant gradient step with a back-tracking line search. This can be 
accomplished with a back-tracking line search step. 

Figure 4 shows the objective value results for each method, with and without 
acceleration. An alternative approach is given by Green [1990], who describes an 
implementation of an EM algorithm for penalised likelihood estimation. 






0 25 50 75 100 


Figure 4: (Adjusted) objective values for iterations of the proximal gradient method, 
with and without acceleration, applied to a Poisson regression problem with a fused 
L^-norm penalty. 

7.4 L^-norm loss plus L^-norm penalty for 0 < q < 1 

A common non-convex penalty is the bridge norm, L'^-norm for 0 < g < 1. There are 
a number of ways of developing a proximal algorithm to solve such problems. The 
proximal operator of L'^-norm has a closed-form, multi-valued solution and conver¬ 
gence results are available for proximal methods in Marjanovic and Solo [2013] and 
Attouch et ah [2013]. For this example, we choose the former approach. 

The regularization problem involves hnd the minimizer of an L^-norm loss with 
an L'^-norm penalty for 0 < g < 1, 

f I 2 ^ 

x1: = argmin < -\\y — Ax\f‘ + A \xi\‘^ 

a; 2 

t J=1 

The component-wise, set-valued proximal L'^-norm operator is given by 

f 0 if \y\ < hx 

prox(?/) = {0,sgn(|/)a;A} if \y\ = hx 

[sgn(?/)x A \y\> hx 

where 

bx,q = (2A(1 - g))5^ 
h\,q = bx,q \qbl~g 
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X + Agx'' ^ = \y\,x G {bx,g, |x|) 

Attouch et al. [2013] describe how the objective for this problem is a Kurdyka- 
Lojasiewicz (KL) function, which provides convergence results for an inexact (multi¬ 
valued proximal operator) forward-backward algorithm given by 

G prox (x* - 7 f(A'^Ax* - A'^b)) . 

Interestingly, the KL convergence results for forward-backward splitting on appropri¬ 
ate non-convex continuous functions bounded below imply that the solution choice for 
multi-valued proximal maps-as in the L'^-norm case-does not affect the convergence 
properties. See Appendix D for more information. 

An alternative approach is the variational representation of the L'^-norm; however, 
this doesn’t satisfy the convergence conditions of Allain et al. [2006] within the half¬ 
quadratic framework. 

Marjanovic and Solo [2013] detail how cyclic descent can be used to apply the 
proximal operator in a per-coordinate fashion under a squared-error loss. The cyclic 
descent method is derived from the following algebra. First, a single solution to the 
squared-error loss minimization problem can be given for a component i of x, by 

0 = Vi/(x) = Aj{Ax -y) = AJ (A^x^ + A_iX_i - y) 

where A* is column i of A, and A_j,x_j have column/element i removed. Applied to 
a quadratic majorisation scheme we find that at iteration t 

t+i ^ Aj{y-A_iX^+^) ^ AJA , 

AfA, ||A||2 * 

with y — Ax* = r*. In a similar fashion to gradient descent, this involves 0{n) 
operations for updates of Afr*, so one cycle is 0{np). 

We simulate a data vector y G M"" from a regression model 

y = Ax -f ere where e N(0,1) 

with an underlying sparse parameter value x G M'* with n = 100, d = 256, in which 
the true sparse x has 5% non-zero signals generated from N(0,1). The design matrix 
A G is also generated from N(0,1) then column normalized. We set the 

signal-to-noise ratio at 16.5 to match the simulated example from Marjanovic and 
Solo [2013] which gives a = 0.0369. 

Figure 5 plots the mean squared error (MSE) versus the log-regularisation penalty 
and the power in the L'^-norm penalty. Essentially, this consists of contours of 
logio(MSE(x)) on a plot of 0 < g < 1 versus the amount of regularization logiQ(A). 
One interesting feature of this model is that the estimated regression coefficients x\ 
can jump to sparsity as 0 < g < 1 , and this will be illustrated in a regularized path 
for the next example. 
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Proximal SNR=16.5 (sigma=0.037), non-zero coef’s=13 
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Figure 5: Penalty weight, A, vs. MSE and q for a L^-norm error with an L'^-norm 
penalty, 0 < g < 1, estimated via cyclic descent and proximal solutions. 

7.5 Prostate Data 

As a practical example of our methodology, we consider the prostate cancer dataset, 
which examines the relationship between the level of a prostate specihc antigen and 
a number of clinical factors. The variables are log cancer volume (Icavol), log 
prostate weight (iweight), age (age), log of the amount of benign prostatic hy¬ 
perplasia (Ibph), seminal vesicle invasion (svi), log of capsular penetration (Icp), 
Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). 

A common regularized approach is to use lasso and elastic net, see Tibshirani 
[1996] and in Zou and Hastie [2005], respectively. Alternatively, we £t the regulari- 
sation path using 



We can use the exact proximal operator for the L'^-norm and solve the harder non- 
convex problem. Figure 6 shows the regularisation path. The major difference is, 
again, in the jumps to a sparse solution. 

8 Discussion 

Proximal algorithms are a widely applied approach to solving optimization problems 
that provide an extension of classical gradient descent methods and have properties 
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that can be used to arrive at many different algorithmic implementations. They are 
iterative shrinkage methods that extend traditional EM and MM algorithms-which 
are presently commonplace in statistics. Beck and Sabach [2013] provide a historical 
perspective on iterative shrinkage algorithms by mainly focusing on the Weiszfeld 
algorithm [Weiszfeld, 1937]. The split Lagrangian methods described here were orig¬ 
inally developed by Hestenes [1969] and Rockafellar [1973]. More recently, there is 
work being done to extend the range of applicability of these methods outside of the 
class of convex functions to the broader class of functions satisfying the Kurdyka- 
Lojasiewicz inequality (see Attouch et ah [2013]). 

The purpose of our approach was to describe and apply the framework provided 
by proximal algorithms for constructing solutions to a large class of optimization 
problems in statistics. These problems often involve composite functions that are 
representable by a sum of a linear or quadratic envelope together with a function that 
has a closed-form proximal operator that is easy to evaluate. Numerous studies exist 
that demonstrate the efficacy and breadth of application of this approach. Micchelli 
et al. [2013, 2011] study proximal operators for composite operators for L^-norm and 
L^-norm/TV denoising models. Argyriou et al. [2011] describe numerical advantages 
of the proximal operator approach versus traditional fused lasso implementations. 
Chen et al. [2013] provides a further class of fixed point algorithms that advance the 
proximal approach in the composite setting. 

Many MM block descent algorithms converge very slowly and there are a number 
of tools available to speed convergence. The most common approach involves Nes¬ 
terov acceleration; see Nesterov [1983] and Beck and Teboulle [2004] who introduce 
a momentum term for gradient-descent algorithms applied to non-smooth composite 
problems. Attouch and Bolte [2009], Noll [2014] provide further convergence proper¬ 
ties for non-smooth functions. O’Donoghue and Candes [2012] use adaptive restart 
to improve the convergence rate of accelerated gradient schemes. Giselsson and Boyd 
[2014] show how preconditioning can help with convergence for ill-conditioned prob¬ 
lems. Meng and Chen [2011] modify Nesterov’s gradient method for strongly convex 
functions with Lipschitz continuous gradients. Allen-Zhu and Orecchia [2014] provide 
a simple interpretation of Nesterov’s scheme as a two step algorithm with gradient- 
descent steps which yield proximal (forward) progress coupled with mirror-descent 
(backwards) steps with dual (backwards) progress. By linearly coupling these two 
steps they improve convergence. Giselsson and Boyd [2014] show how precondition¬ 
ing can help with convergence for ill-conditioned problems. 

There are a number of directions for future research on proximal methods in 
statistics, for example, exploring the use of Divide and Concur methods for mixed 
exponential family models, and the relationship between proximal splitting and vari¬ 
ational Bayes methods in graphical models. Another interesting area of research 
involves combining proximal steps with MCMC algorithms [Pereyra, 2013]. 
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A Convergence 

We now establish convergence results for the forward-backward proximal solution to 
(15) given in (9) 

X* = prox{x — V/(a;)/A} , 

4>l\ 

when I and 0 are lower semi-continuous and V/ is Lipschitz continuous. We also 
assume that prox^y;^ is non-empty and can be evaluated independently in each com¬ 
ponent of y. 

Recalling the translation property of proximal operators stated in 13, we can say 
X* = prox (x — V/(x)/A) = prox (x) 

f T A 

= argmin < (f){z) -|- Vl{z) (^ — x) — ||x — ^ 

2 : ^ 

By the proximal operator’s minimizing properties, its solution x* satishes 
0(x*) -I- V/(x*)^(x* — a;) -I- ^||x — x*|p < 0(x) 
providing a sort of quadratic minorizer for F{w) in the form of 

l{w) + 0(x*) -f V/(x*)'^(x* — w) + ^\\w — x*|p < l{w) + (({w) = F{w) 

The Lipschitz continuity of V/(x), i.e. 

/(x) < l{w) + Vl{w)'^{x — w) + ^||x — , 

also gives us a quadratic majorizer 

F(x) = /(x) -I- 0(x) < l{w) + — x) -I- ^||x — x*||^ 

which, when evaluated at x = x* and combined with our minorizer yields 

(A — 7 )^||a:* — < F{w) — F(x*) 
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Thus, if we want to ensure that the objective value will decrease in this procedure, 
we need to £x A > 7 . Furthermore, functional characteristics of / and 0, such as 
convexity, can improve the bounds in the steps above and guarantee good-or optimal- 
decreases in F{w) — F{x*). 

Finally, when we compound up the errors we obtain a 0(1/k) convergence bound. 
This can be improved by adding a momentum term to y that includes the first deriva¬ 
tive information. 

These arguments can be extended to Bregman divergences by way of the general 
law of cosines inequality 

D{x^ w) = D{x^ z) + D{w^ z) + {'Vl{z) — Vl{w))'^{x — w) , 
so that D{x,w) > D{x, P{w)) + D{P{w),w) where P{w) = argmin^ iA(n, tc). 

B Nesterov Acceleration 

A powerful addition is Nesterov acceleration. Consider a convex combination, with 
parameter 9, of upper bounds for the proximal operator inequality z = x and 2 : = 
X*. We are free to choose variables = Ox + {1 — d)x'^ and w. If 0 is convex, 
(i){9x + (1 — 9)x'^) < d(p{x) + (1 — 9)(j){x~^), then we have 

F{x+) - F* - (1 - e){F{x) - F*) 

= F(a;+) - OF* - (1 - e)F{x) 

< \{x~^ — w)'^{9x* + (1 — 9)x — x~^) + ^ I \x^ ~ r 
= ^ (^||tc — (1 — 9)x — 9x*\\‘^ — llx’*' — (1 — 0)x — 



Where w is given in terms of the intermediate steps 

9u = w — {1 — 9)x 
6 u^ = x^ — {1 — 9)x 

Introducing a sequence 9t with iteration subscript, t. The second identity, 9u = 
a; — (1 — 9)x~, then yields an update for w as the current state x plus a momentum 
term, depending on the direction [x — x~), namely 

w = (1 - 9t)x + 6tu = X - - 9t){x - x~) 


C Quasi-convex Convergence 

Consider an optimisation problem where / is quasi-convex, continuous 

and has non-empty set of finite global minima. Let x^ be generated by the proximal 
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point algorithm 


X* G argmin 


l{x) + ^\\x — x^W^ 


Quiroz and Oliveira [2009] show that these iterates converge to the global minima, al¬ 
though the proximal operator at each step may be set-valued-due to the non-convexity 
of Z. A function I is quasi-convex when 


l{9x + (1 — 0)z) < max(Z(a;), l{z)) , 

which accounts for a number of non-convex functions like \x\‘^, when 0 < g < 1, and 
functions involving appropriate ranges of log(a;) and tanh(a;). In this setting, using 
the level-sets generated by the sequence, i.e. U = {x & dom(Z) : l{x) < inh Z(a;Q}, one 
hnds that f/ is a non-empty closed convex set and that x^ is a Fejer sequence of hnite 
length, — x*|| < cxd, and that it converges to a critical point of I as long as 

min {l{x) : a; G is nonempty. 


D Non-convex: Kurdyka-Lojasiewicz (KL) 

A locally Lipschitz function Z : — )■ M satishes KL at x* G if and only if 

G (0, oo) and a neighbourhood U of x* and a concave k : [0, rj] —)■ [0, oo) with 
k(0) = 0, k G C^, k' > 0 on ( 0 , 77 ) and for every x E U with l{x*) < Z(x) < l{x*) + rj 
we have 

k! {Z(a:) — Z(a;*)} dist (0, dl{x)) > 1 
where dist(0. A) = sup||a;|p. 

x^A 

The KL condition guarantees summability and therefore a hnite length of the 
discrete subgradient trajectory. Using the KL properties of a function, one can show 
convergence for alternating minimisation algorithms for problems like 

min L{x, z): = Z(x) -|- Q{x, z) + (/>(z) , 

X,Z 

where VQ is Lipschitz continuous (see Attouch et ah [2010, 2013]). A typical appli¬ 
cation involves solving min^-gjid {Z(x) - 1 - 0(x)} via the augmented Lagrangian 

L(x, z) = l{x) + 0(2;) A"^(x - 2 r) -h ^\\x - zW^ 

where p is a relaxation parameter. 

A useful class of functions that satisfy KL as ones that possess uniform convexity 

l{y) > Z(a;) -|- (z — x) + K\\z — a:||^, where p > 1 ,\/u E dl{x) . 

Then Z satishes KL on dom(Z) for k{s) = pK~psp . 

For explicit convergence rates in the KL setting, see [Frankel et ah, 2014]. 
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Figure 6: Proximal results for the prostate data example under the L^-norm penalty. 
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Type 


4i{x) 


prox^^(y) 


Laplace 
Gaussian 
Group-sparse, £p 


Gamma, Chi 
Double-Pareto 

Huber dist. 


u;||a:|| 

t||x|P 

K\\xr 

p = 4/3 

p = 3/2 
p = 3 

p = 4 

—Kina: -|- cox 
7 log(l -H |a:|/a) 

f Tx"^ \x\ < 

1 a;v^|a:| — a;^/2 otherwise 
W, T G (0, -l-oo) 


Max-entropy dist. u)\x\ + t\x\^ + n\x\^ 
2 ^pG (1,-hoo), 

W, T, K G (0, -l-oo) 

u}\x\ — ln(l -I- u:\x\) 


Smoothed-laplace 
dist. 

Exponential dist. 


Uniform dist. 


Triangular dist. 


Weibull dist. 


GIG dist. 


ujx x>Q 
- 1-00 a; < 0 

—(jJ X < —OJ 
X |a:| < w 

UJ X > UJ 

— \n(x — w) -|- ln(—w) X G (w, 0) 

— ln(w — x) -|- ln(a)) x G (0, Cj) 

- 1-00 otherwise 

ui G (—00,0], (h G (0,oo) 

{ —Klnx-l-wx^ x>0 
-1-00 X < 0 

p G (1, -l-oo) UJ,K G (—oo, 0] 

{ —Klnx-l-wx-l-p/x x>0 
-1-00 X < 0 

ijj,K,p G (—oo, 0] 


sgn(x) max(||x|| — a;, 0 ) 

x/( 2 t -I- 1 ) 

sgn(x)p, 

p S.t. p-\-pKff~^ = ||x|| 

X = 4_ 256k3/729 

X + 

9 k^ sgn(x) — a /1 -I- 16 |x|/( 9 k 2 )^ /8 
sgn(x) -I- 12 k|x| - 1 ^ /(6k) 

\ 8k J \ 8k J 

X = x/x^ + 1 /( 27 k) 

^ (^x — UJ + \/(x — uj)^ + 4 k ^ 

||x| -a+y/(a- |x |)2 -h 4 d(x)|, 
d(x) = (ajxj - 7 )+ 

jxj < a;(2r-I-l)/v^ 


2t+1 


X — uj\/^sgn(x) jxj > a;(2r-I-l)/v^ 


sgn(x) prox , ^ 

k\-\p/{2t+1) 


maxi X — uj. 


/ V Lo\x\—ljJ ^ —l + -\/fo|a:|——l|^+ 4 i.i;|a;| 


X — UJ X > U! 

0 X < UJ 

X — UJ X > U! 

0 X < UJ 

( x+cl:+,J\x—ui \^+4 , 

) - ^ - X < 1/UJ 


1 


x+ui—\/\x—uj\^+4 
- ^ - X > XjUJ 


TT S.t. pUJTrP + TT'^ — XTT = K 


S.t. TT^ + (uJ — x)tt‘^ — KTT = P 


Table 1: Sources: [Chaux et al., 2007] [Hu et al.] 
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Penalty 


Minimizer 


cj){t) = mins {Q{t, s) + V'(s)} 
|t|“,ae (1,2] 
y/a + t^ 

^ - log (l + 

^ |i|<o 
lajtj — ^ jtj > a 
log(cosh(at)) 

“i+W 


1 

i+Vx 


Q{t, s) = 


ajtj 


a-2 


\/ 

1 

a(a+|i|) 

J1 |t| < a 

i^i>« 

tanh(Q:f) 

^ t 

'-2 
sgn (t) 

— OO 

1 


2 ti{vi+iy 


for t = 0 
otherwise 
for t = 0 
otherwise 


Q{t,s) = {t-sy 


ct — 


ct — 




a(a+|i|) 


I (c- l)t 
1 ct — a sgn(t) 

ct — a tanh(Q;t) 


< a 
> a 


ct — 


ct — 


sgn (t) 

(R+iF 


2Vi{Vi+lY 


Table 2: Minimizers for the multiplicative form are a(t) = s , ^ ^ , and 

^ \(j){t)/t iit^O ’ 

for additive form a{t) = ct — 4> {t). See [Nikolova and Ng, 2005]. 



Error Rate 


Algorithm 

Convex 

Strongly Convex 

Per-Iteration Cost 

Accelerated Gradient 
Descent 

Oil/yfe) 

0(log(l/e)) 

0(n) 

Proximal Gradient 

Descent 

0(l/e) 

0(log(l/e)) 

0(n) 

Accelerated Proximal 
Gradient Descent 

o(i/Fi) 

0(log(l/e)) 

0{n) 

ADMM 

0(l/e) 

0(log(l/e)) 

0{n) 

Frank-Wolfe / 

Gonditional Gradient 
Algorithm 

0(l/e) 

o(i/v^) 

0{n) 

Newton’s Method 


0(loglog(l/e)) 

0{v?) 

Gonjugate Gradient 
Descent 


0{n) 

0{n^) 

L-BFGS 


Between 0(log(l/e)) 
and 0(loglog(l/e)) 

0{v?) 


Table 3: See [Dnckworth, 2014], 
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